Parameterization of the Interlingua in Machine Translation
نویسنده
چکیده
The task of designing an interlingual machine translation system is difficult, first because the designer must have n knowledge of the principles underlying cross~ linguistic distinctions for the languages under consideration, and second because the designer must then be able to incorporate this knowledge effectively into the system. This paper provides a catalog of several types of distinctions among Spanish, English, and German, and describes a parametric approach that characterizes these distinctions, both at the syntactic level and at the lexical-semantic level. The approach described here is implemented in a system called UNITRAN, a machine translation system that translates English, Spanish, and German bidirectionally. 1 I n t r o d u c t i o n What makes the task of designing an interlingual machine translation system difficult is the requirement tha t the t ranslator process many types of language-specific phenomena while still maintaining language-independent information about the source and target languages. Given tha t these two types of knowledge (language.specific and languageindependent) are required to fulfill the translation task, one approach to designing a machine translation system is to provide a common language. independent representation tiiat acts as a pivot between the source and target languages, and to provide a parameterized mapping between this form and the input and output of each language. This is the approach taken in UNITRAN, a machine translation system that translates English, Spanish, *This paper describes research done at the University of Maryland Institute for Advanced Computer Studies and at the MIT Artificial Intelligence Labo~ ratory. Useful guidance and commentary during the research and preparation of this document were provided by Bob Berwick, Gary Coen, Bruce Dawson, Klaudis Dussa-Zieger, Terry Gaasterland, Ken Hale, Mike Kashket, Jorge Lobo, Panla Merlo, James Pustejovsky, Jeff Siskind, Clare Vess, Amy Weinberg, and Patrick Winston. '7-r Figure 1: Overall Design of the UN|TRAN System and German bidirectionally. The pivot form tha t is used in this system is a lexical conceptual s tructure (henceforth, LCS) (see Jackendoff(1983, 1990), Hale & Laughren (1983), Hale & Keyser (1986a, 1986b), and Levin & Rappapor t (1986)), which is a form tha t underlies the sourceand targetlanguage sentences. The pivot approach to translat ion is called intcvliugual because it relies on an underlying form derived from universal principles t ha t hold across all languages. Within this framework, distinctions between languages are accounted for by settings of parameters associated with the universal principles. For example, there is a universal principle tha t requires there to be a conceptual subject for each predicate of a sentence. Whether or not the couceptual subject is syntactically realized is determined by a parameter associated with this principle: the null subject parameter. This parameter is set to yes for Spanish (also, Italian, Hebrew, etc.) but no for English and German (also French, Warlpiri, etc.). The setting of the null subject parameter accounts for the possibility of a missing subject in Spanish and the incorrectness of a missing subject in English and German (except for the imperative form). This paper argues that , not only should the syntactic component of a machine translation system be parameterized, but other components of a machine translation system would also benefit f rom the parameterization approach. In part icular , the lexicalsemantic component must be constructed in such a way as to allow principles of the lexicon to be parameterized. Thus, UNITRAN uses two levels of processing, syntactic and lexical-semantie, both of which operate on the basis of language-independent ACRES DE COLING-92, NANTES, 23-28 AOt~r 1992 6 2 4 PROC. OF COLING-92, NANTES, AUO. 23-28, 1992 knowledge that is parameterized to encode lauguage~ specific information (see figure 1). Within the syntactic level, the languageindependent and language-speeilic information are supplied, respectively, by the principles and parmnetets of government-binding theory (henceforth, GB) (see Chomaky (1981, 1982)). Within the lexical-semantie level, the language-independent and language-specific information are supplied by a set of general LCS mappings and the associated parameters for each language, respectively. Tim interface between the syntactic and semantic levels allows the source-language structure to be mapped systematically to the conceptual form, and it allows the targetdanguage structure to be realized systematically from lexical items derived from the conceptual form. This work represents a shift away from colnplex, language-specific syutactic translation without entirely abandoning syntax. Furthermore, the work moves toward a model that employs a well-defined lexieal conceptual representation without requiring a "deep" semantic conceptualizatiou. Consider the following example: (1) (i) I stabbed Jnhn (ii) Yo le di pufialadas a Juan 'I gave knife-wounds to John' This example illustrates a type of distiuctiou (henceforth called divergence as presented in Dorr (1990a)) that arises in machine translation: the sourcelanguage predicate, stab, is ,napped to more than one target-language word, dar puiialadas a. This divergence type is lezical in that there is a word selection variation between the source language and the target language. Such divergeuees are accounted for by lexical-semantie parameterization, as we will see in section 3. The following section of this paper will provide a catalog of syntactic divergences between the source and target languages. The set of parameters that are used to account for these divergences will be described. In the third section, we will exanfine the divergences that occur at tire lexical-semantie level, and we will see how the parametric approach accounts for these divergences as well. Finally, we will turu to the evaluation and coverage of tile system. 2 T o w a r d a C a t a l o g o f S y n t a c t i c D i v e r g e n c e s Figure 2 shows a diagram of the UNITItAN syntactic processing component. The parser of this component provides a source-language syntactic structure to the lexical-semantic processor, and, after lexicalsemantic processing is completed, the generator of this component provides a target-language syntactic structure. Both the parser and generator of this component have access to the syntactic principles of GB theory. These principles, which act as constraints (i.e., filters) on the syntactic structures proFigure 2: Design of the Syntactic Processing Component duced hy the parser and the generator, operate on tim basis of parameter settings that supply certain lauguage-specific iulbrmation; this is where syntactic divergences are factored out from the lexicalsemantic representation. The Gll principles and parameters are organized into modules whtme constraints are applied in the following order: (1) X, (2) Boundiug, (3) Case, (4) 'iYace, (5) Ilinding, and (6) 0. A detailed descriw tiou of these modules is provided in Dorr (1987). We will look t, riefiy at a number of these, /hensing on how syntactic divergences are accounted for by this approach. Figure 3 smmnarizes the syntactic divergences that are revealed by the parametric variations presented here.l 2.1 P r i n c i p l e s a n d P a r a m e t e r s of t h e X Modal(," The X" constraiut module of the syntactic component provides the phrase-structure representation of sentenees. In particular, the fundamental principle of the X module is that each phrase of a sentence has a maz imal projection, X-MAX, lor a head of category X (see tigure 4). ~ In addition to the head X, a phrasal projection potentially contaius satellites c~1, a~, ill, f12, 71, and 72, where cq attd ~2 are any nulnber of maximally adjoined adjuncts positioned accurding to the adjuaclion parameter, fll aud f12 are arguments (subjects aud objects) ordered according to the constituent order parameter, and 71 and 72 are any number of minimally adjoined adjuncts p ~ sitioued according to the adjunctiou parameter. 3 tThe syntactic divergences are enumerated with r~ spect to the relevant pasametera and modules of the syntactic component. The figure illustrates the effect of syntactic parameter settings on tile constituent structure for each language. (In this figure, E stands for English, G for German, S for Spanish, and I for Icelandic.) aThe possibilities for the category X are: (V)erb, (N)oua, (A)djective, (P)reptmition, (C)omplementizer, and (1)affection. 't'ite Complementizer corresponds to relative pronouns such as that in the matt that I saw. The IntlectionM category corresponds to modals such as would in 1 would eat cttke. 3This is a revised version of the "X-Theory presented in Chomsky (1981). Tire adjunction par~ueter will not be discussed here, but see Dorr (1987) for details. ACrEs DE COTING-92, NANTES, 23-28 ^O~" 1992 6 2 S Paoc. OF COL1NG-92, NANYV.S, AUO. 23-28, 1992 Syntactic Divergence Examples Parameter GB Module E, S: V preccd¢~ object constituent X G: V followe object order E: P stranding allowed proper Gov~t S, G: No P stranding allowed governors E, G: Fronted question word bounding Bounding beyond |ingle sentence nodes level not allowed S: Fronted quenion word beyond single sentence level allowed E, G: P not ¢~quired before type of verbal object anaocigovernment ated with elitic S: p required before verbal object a~aociated with elitic E, G: Subject required in msnull nubtrlx claule ject S: Subject not required in matrix clau~ E, S, G: Anaphor (e.g., him. governing Binding self) must have ancategory tecedent inside neareta dominating clauBe Anuphor (e,g. , siq) I: may have antecedent outside nearest dominating clause E: No empty pleonastics NDP 0 allowed S: Empty pleonaatica allowed G: Empty pleonastics in embedded claunes only Figure 3: S u m m a r y of Syntact ic Divergences
منابع مشابه
Development of Cross-linguistic Syntactic and Semantic Parameters for Parsing and Generation 1
This document reports on research conducted at the University of Maryland for the Korean/English Machine Translation (MT) project. The translation approach adopted here is interlingual i.e., a single underlying representation called Lexical Conceptual Structure (LCS) is used for both Korean and English. The primary focus of this investigation concerns the notion of`parameterization' i.e., a mec...
متن کاملAn interlingua based on domain actions for machine translation of task-oriented dialogues
This paper describes an interlingua for spoken language translation that is based on domain actions in the travel planning domain. Domain actions are composed of speech acts (e.g., requestinformation), attributes (e.g., size, price), and objects (e.g., hotel, flight) and can take arguments. Development of the interlingua is guided by a database containing travel dialogues in English, Korean, Ja...
متن کاملMachine Translation Strategies: A Comparison of F-Structure Transfer and Semantically Based Interlingua
Two machine translation (MT) systems which respectively utilize the transfer and interlingua strategies will be presented and compared, emphasizing design principles. Feature structures and unification-based grammar are common denominators for the two MT systems; in particular, both make use of Lexical-Functional Grammar (LFG). In the transfer system. Machine Translation Toolkit, developed by E...
متن کاملInterlingua based statistical machine translation
In goal oriented spoken language translation, an interlingua based approach has proven quite useful as it (1) reduces overall effort when multiple language pairs are required, (2) can provide a paraphrase of semantic equivalence in the input language, (3) abstracts away from the disfluencies of spoken language to express the speaker s intention. On the other hand, interlingua based systems are ...
متن کاملApproximating an Interlingua in a Principled Way
We address the problem of constructing in a principled way an ontology of terms to be used in an interlingua for machine translation. Given our belief that the a true language-neutral ontology of terms can only be approached asymp-totically, the construction method outlined involves a step-wise folding in of one language at a time. This is effected in three steps: first building for each langua...
متن کاملInterlingual Machine Translation
The first part of this paper considers some of the reasons why mechanical translation via a logically formalized interlingua is worth pursuing. The interlingua described consists of a network of bonded semantic elements, the bonds being either homogeneous, corresponding to a generalized notion of qualification, or heterogeneous, for dyadic relations. The translation procedure involves a basic p...
متن کامل